The Sales and Marketing team of Kira Plastinina would like to understand their customer’s behavior from data that they have collected over the past year. More specifically, they would like to learn the characteristics of customer groups.
The analysis will be successful if I can identify trends in groups of users and come up with a model that can take that into account to identify user that will bring revenue to the company or not
About Kira Plastinina is a Russian brand that is sold through a defunct chain of retail stores in Russia, Ukraine, Kazakhstan, Belarus, China, Philippines, and Armenia.
“Administrative”, “Administrative Duration”, “Informational”, “Informational Duration”, “Product Related” and “Product Related Duration” represents the number of different types of pages visited by the visitor in that session and total time spent in each of these page categories. The values of these features are derived from the URL information of the pages visited by the user and updated in real-time when a user takes an action, e.g. moving from one page to another.
The “Bounce Rate”, “Exit Rate” and “Page Value” features represent the metrics measured by “Google Analytics” for each page in the e-commerce site.
The value of the “Bounce Rate” feature for a web page refers to the percentage of visitors who enter the site from that page and then leave (“bounce”) without triggering any other requests to the analytics server during that session.
The value of the “Exit Rate” feature for a specific web page is calculated as for all pageviews to the page, the percentage that was the last in the session.
The “Page Value” feature represents the average value for a web page that a user visited before completing an e-commerce transaction
The “Special Day” feature indicates the closeness of the site visiting time to a specific special day (e.g. Mother’s Day, Valentine’s Day) in which the sessions are more likely to be finalized with the transaction. The value of this attribute is determined by considering the dynamics of e-commerce such as the duration between the order date and delivery date. For example, for Valentina’s day, this value takes a nonzero value between February 2 and February 12, zero before and after this date unless it is close to another special day, and its maximum value of 1 on February 8.
The dataset also includes the operating system, browser, region, traffic type, visitor type as returning or new visitor, a Boolean value indicating whether the date of the visit is weekend, and month of the year.
library('data.table')
library('tidyverse')
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.4
## ✓ tibble 3.0.4 ✓ dplyr 1.0.2
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::between() masks data.table::between()
## x dplyr::filter() masks stats::filter()
## x dplyr::first() masks data.table::first()
## x dplyr::lag() masks stats::lag()
## x dplyr::last() masks data.table::last()
## x purrr::transpose() masks data.table::transpose()
customers <- fread('http://bit.ly/EcommerceCustomersDataset')
head(customers)
tail(customers)
str(customers)
## Classes 'data.table' and 'data.frame': 12330 obs. of 18 variables:
## $ Administrative : int 0 0 0 0 0 0 0 1 0 0 ...
## $ Administrative_Duration: num 0 0 -1 0 0 0 -1 -1 0 0 ...
## $ Informational : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Informational_Duration : num 0 0 -1 0 0 0 -1 -1 0 0 ...
## $ ProductRelated : int 1 2 1 2 10 19 1 1 2 3 ...
## $ ProductRelated_Duration: num 0 64 -1 2.67 627.5 ...
## $ BounceRates : num 0.2 0 0.2 0.05 0.02 ...
## $ ExitRates : num 0.2 0.1 0.2 0.14 0.05 ...
## $ PageValues : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SpecialDay : num 0 0 0 0 0 0 0.4 0 0.8 0.4 ...
## $ Month : chr "Feb" "Feb" "Feb" "Feb" ...
## $ OperatingSystems : int 1 2 4 3 3 2 2 1 2 2 ...
## $ Browser : int 1 2 1 2 3 2 4 2 2 4 ...
## $ Region : int 1 1 9 2 1 1 3 1 2 1 ...
## $ TrafficType : int 1 2 3 4 4 3 3 5 3 2 ...
## $ VisitorType : chr "Returning_Visitor" "Returning_Visitor" "Returning_Visitor" "Returning_Visitor" ...
## $ Weekend : logi FALSE FALSE FALSE FALSE TRUE FALSE ...
## $ Revenue : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## - attr(*, ".internal.selfref")=<externalptr>
Checking for null values
colSums(is.na(customers))
## Administrative Administrative_Duration Informational
## 14 14 14
## Informational_Duration ProductRelated ProductRelated_Duration
## 14 14 14
## BounceRates ExitRates PageValues
## 14 14 0
## SpecialDay Month OperatingSystems
## 0 0 0
## Browser Region TrafficType
## 0 0 0
## VisitorType Weekend Revenue
## 0 0 0
14 null rows in multiple columns were found
customers[is.na(customers$Administrative)==TRUE]
Null values exist in the same rows and will be eliminated
dim(customers)
## [1] 12330 18
dim(na.omit(customers))
## [1] 12316 18
customers <- na.omit(customers)
Checking for duplicates
dim(customers[duplicated(customers)])
## [1] 117 18
117 duplicated variables and they will be eliminated
customers <- customers[!duplicated(customers)]
Getting a summary of the dataset
summary(customers)
## Administrative Administrative_Duration Informational
## Min. : 0.00 Min. : -1.00 Min. : 0.0000
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.0000
## Median : 1.00 Median : 9.00 Median : 0.0000
## Mean : 2.34 Mean : 81.68 Mean : 0.5088
## 3rd Qu.: 4.00 3rd Qu.: 94.75 3rd Qu.: 0.0000
## Max. :27.00 Max. :3398.75 Max. :24.0000
## Informational_Duration ProductRelated ProductRelated_Duration
## Min. : -1.00 Min. : 0.00 Min. : -1.0
## 1st Qu.: 0.00 1st Qu.: 8.00 1st Qu.: 193.6
## Median : 0.00 Median : 18.00 Median : 609.5
## Mean : 34.84 Mean : 32.06 Mean : 1207.5
## 3rd Qu.: 0.00 3rd Qu.: 38.00 3rd Qu.: 1477.6
## Max. :2549.38 Max. :705.00 Max. :63973.5
## BounceRates ExitRates PageValues SpecialDay
## Min. :0.00000 Min. :0.00000 Min. : 0.000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.01422 1st Qu.: 0.000 1st Qu.:0.00000
## Median :0.00293 Median :0.02500 Median : 0.000 Median :0.00000
## Mean :0.02045 Mean :0.04150 Mean : 5.952 Mean :0.06197
## 3rd Qu.:0.01667 3rd Qu.:0.04848 3rd Qu.: 0.000 3rd Qu.:0.00000
## Max. :0.20000 Max. :0.20000 Max. :361.764 Max. :1.00000
## Month OperatingSystems Browser Region
## Length:12199 Min. :1.000 Min. : 1.000 Min. :1.000
## Class :character 1st Qu.:2.000 1st Qu.: 2.000 1st Qu.:1.000
## Mode :character Median :2.000 Median : 2.000 Median :3.000
## Mean :2.124 Mean : 2.358 Mean :3.153
## 3rd Qu.:3.000 3rd Qu.: 2.000 3rd Qu.:4.000
## Max. :8.000 Max. :13.000 Max. :9.000
## TrafficType VisitorType Weekend Revenue
## Min. : 1.000 Length:12199 Mode :logical Mode :logical
## 1st Qu.: 2.000 Class :character FALSE:9343 FALSE:10291
## Median : 2.000 Mode :character TRUE :2856 TRUE :1908
## Mean : 4.075
## 3rd Qu.: 4.000
## Max. :20.000
assigning column values to variables for ease of use
admin <-customers$Administrative
admin_duration <-customers$Administrative_Duration
info <-customers$Informational
info_duration <-customers$Informational_Duration
product <-customers$ProductRelated
product_related_duration<-customers$ProductRelated_Duration
bounce_rates <-customers$BounceRates
exit_rates <-customers$ExitRates
page_values <-customers$PageValues
special_day <-customers$SpecialDay
month <-customers$Month
os <-customers$OperatingSystems
browser <-customers$Browser
region <-customers$Region
traffic_type <-customers$TrafficType
visitor_type <-customers$VisitorType
weekend <-customers$Weekend
revenue <-customers$Revenue
Checking for Outliers in the dataset
out <- boxplot(admin)$out
print(paste(length(out),'outliers'))
## [1] "404 outliers"
out <- boxplot(admin_duration)$out
print(paste(length(out),'outliers'))
## [1] "1147 outliers"
out <- boxplot(info)$out
print(paste(length(out),'outliers'))
## [1] "2630 outliers"
out <- boxplot(info_duration)$out
print(paste(length(out),'outliers'))
## [1] "2437 outliers"
out <- boxplot(product)$out
print(paste(length(out),'outliers'))
## [1] "1007 outliers"
out <- boxplot(product_related_duration)$out
print(paste(length(out),'outliers'))
## [1] "951 outliers"
out <- boxplot(bounce_rates)$out
print(paste(length(out),'outliers'))
## [1] "1431 outliers"
out <- boxplot(exit_rates)$out
print(paste(length(out),'outliers'))
## [1] "1327 outliers"
out <- boxplot(page_values)$out
print(paste(length(out),'outliers'))
## [1] "2730 outliers"
out <- boxplot(special_day)$out
print(paste(length(out),'outliers'))
## [1] "1249 outliers"
out <- boxplot(os)$out
print(paste(length(out),'outliers'))
## [1] "107 outliers"
out <- boxplot(browser)$out
print(paste(length(out),'outliers'))
## [1] "4321 outliers"
out <- boxplot(region)$out
print(paste(length(out),'outliers'))
## [1] "505 outliers"
out <- boxplot(traffic_type)$out
print(paste(length(out),'outliers'))
## [1] "2083 outliers"
While there were many outliers in the dataset, removing them would affect the data, since the users are unique users from different countries with different norms, eliminating these would have a bias against users not withing the ‘normal’ range
get.mode <- function(v){
uniq <- unique(v)
# gets all the unique values in the column
# match (v, uniq) matches a value to the unique values and returns the index
# tabulate (match (v, uniq)) takes the values in uniq and counts the number of times each integer occurs in it.
# which.max() gets the index of the first maximum in the tabulated list
# then prints out the uniq value
uniq[ which.max (tabulate (match (v, uniq)))]
}
# lazy method of performing univariate analysis
univar <- function (column,dataset,plot='hist'){
subject <- dataset[[column]] # Get the subject of the analysis
# Check the plot specified if none specified default is a histogram
if (plot == 'category'){
# For categorical data best tto plot barplots for each category
ggplot(dataset,aes(subject))+ geom_bar(fill='#222222')
}else{ #if not categorical continue here
if (plot == 'hist') { # if histogram plot a histogram
plt <- geom_histogram(fill = "#222222", colour = "#038b8d")
}
else if (plot == 'density'){ # if density plot a density plot
plt <- geom_density()
}
print(paste('mean : ',mean(subject)))
print(paste('mode : ',get.mode(subject)))
print(paste('median : ',median(subject) ))
print(paste('max : ', max(subject),' min :', min(subject)))
print(paste('quantile 5%: ', quantile(subject,probs=c(0.05)),'quantile 95%: ', quantile(subject,probs=c(0.95)) ))
print(paste('standard deviation :', sd(subject)))
ggplot(dataset,aes(subject)) + plt
}
}
univar('Revenue',customers,'category')
Most users would not bring revenue
univar('Administrative',customers,'category')
Most of users did not get into administrative pages
univar('Administrative_Duration',customers)
## [1] "mean : 81.6821434502838"
## [1] "mode : 0"
## [1] "median : 9"
## [1] "max : 3398.75 min : -1"
## [1] "quantile 5%: 0 quantile 95%: 352.23428568"
## [1] "standard deviation : 177.528167792632"
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The range of time spent on administrative sites was from -1 to 3398.75 minutes. For most people spent between 0 and 352 minutes in the administrative sites The average time spent on the admin sites was 81.68 minutes but given most peple did not get into administrative pages alot of users spent 0 minutes in the pages. Graph is also skewed to the right
univar('Informational',customers,'category')
Most of users did not get into administrative pages
univar('Informational_Duration',customers)
## [1] "mean : 34.8373363758361"
## [1] "mode : 0"
## [1] "median : 0"
## [1] "max : 2549.375 min : -1"
## [1] "quantile 5%: 0 quantile 95%: 199"
## [1] "standard deviation : 141.458498734393"
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The range of time spent on informational sites was from -1 to 2549 minutes. For most people spent between 0 and 199 minutes in the informational sites The average time spent on the informational sites was 81.68 minutes but given most peple did not get into informational pages alot of users spent 0 minutes in the pages. Graph is also skewed to the right, meaning more people are found to the first parts of the graph
univar('ProductRelated',customers)
## [1] "mean : 32.0584474137224"
## [1] "mode : 1"
## [1] "median : 18"
## [1] "max : 705 min : 0"
## [1] "quantile 5%: 2 quantile 95%: 110"
## [1] "standard deviation : 44.6009113793394"
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The average number of product pages visited was 32, with most people visiting 1 page the number of pages visited ranged from 0 to 705 pages and most people visited between 2 and 110 pages. Graph is also skewed to the right, meaning more people are found to the first parts of the graph
univar('ProductRelated_Duration',customers)
## [1] "mean : 1207.50818855772"
## [1] "mode : 0"
## [1] "median : 609.5416667"
## [1] "max : 63973.52223 min : -1"
## [1] "quantile 5%: 0 quantile 95%: 4313.4514098"
## [1] "standard deviation : 1919.92747174039"
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The range of time spent on product related sites was from -1 to 63973.52 minutes. For most people spent between 0 and 4313.45 minutes in the product related sites The average time spent on the product related pages was 1207.51 minutes but given most people visited 1 product related page alot of users spent 0 minutes in the pages. Graph is also skewed to the right
univar('BounceRates',customers)
## [1] "mean : 0.0204467350769735"
## [1] "mode : 0"
## [1] "median : 0.002930403"
## [1] "max : 0.2 min : 0"
## [1] "quantile 5%: 0 quantile 95%: 0.15"
## [1] "standard deviation : 0.0454025027384669"
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Bounce rates ranged from 0 and 0.2, and most users lie between 0 and 0.15 The average bounce rate was at 0.02, most users had an bounce rate of 0. the graph is mostly skewed to the right
univar('ExitRates',customers)
## [1] "mean : 0.0414967835993114"
## [1] "mode : 0.2"
## [1] "median : 0.025"
## [1] "max : 0.2 min : 0"
## [1] "quantile 5%: 0.004545455 quantile 95%: 0.175"
## [1] "standard deviation : 0.046247160452655"
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Exit rates ranged from 0 and 0.2, and most users lie between 0.0045 and 0.175 The average exit rate was at 0.0415, most users had an exit rate of 0.2. the graph is mostly skewed to the right
univar('PageValues',customers)
## [1] "mean : 5.95250015960423"
## [1] "mode : 0"
## [1] "median : 0"
## [1] "max : 361.7637419 min : 0"
## [1] "quantile 5%: 0 quantile 95%: 38.311637762"
## [1] "standard deviation : 18.6577915027872"
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Alot of users visited 0 pages before completing an transaction, while the average pages visited was 5. The number of pages visited before making a transaction ranged from 0 to 361 pages. Graph is also skewed to the right, meaning more people are found to the first parts of the graph
univar('SpecialDay',customers,'category')
Most people did not visit the site close to special days
univar('OperatingSystems',customers,'category')
Most users visiting the site used the Operating system encoded as 2. Operating systems 1,2 and 3 were the most commonly used
univar('Browser',customers,'category')
Browsers 2 were the most commonly used, However browser 1 was also highly used.
univar('Region',customers,'category')
Most users were from region 1 and 2 respectively Region 5 had the least number of users.
univar('TrafficType',customers,'category')
Most users had traffic type 2,1 and 3 respectively, with few users with traffic 16,17 and 18
ggplot(customers,aes(customers$Administrative,fill=customers$Revenue))+ geom_bar()
few users who had a low count of administrative pages had led to revenue to the site, with most people with a count of 0 leading to no revenue
ggplot(customers,aes(admin_duration,fill=revenue))+ geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
few users who had a low duration in administrative pages had led to revenue to the site, with most people with a count of 0 leading to no revenue
ggplot(customers,aes(info,fill=revenue))+ geom_bar()
few users who had a low count of informational pages had led to revenue to the site, with most people with a count of 0 leading to no revenue
ggplot(customers,aes(info_duration,fill=revenue))+ geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
few users who had a low duration in informational pages had led to revenue to the site, with most people with a count of 0 leading to no revenue.
ggplot(customers,aes(product,fill=revenue))+ geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
few users who had a low count of product pages had led to revenue to the site, with most people leading to no revenue
ggplot(customers,aes(product_related_duration,fill=revenue))+ geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
few users who had a low duration in product pages had led to revenue to the site, with most people leading to no revenue
ggplot(customers,aes(bounce_rates,fill=revenue))+ geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
people with higher bounce rates tend to not lead to revenue compared to a high number of users leading to revenue with low bounce rates
ggplot(customers,aes(exit_rates,fill=revenue))+ geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
people with higher exit rates tend to not lead to revenue compared to a high number of users leading to revenue with low exit rates
ggplot(customers,aes(page_values,fill=revenue))+ geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
most users with a low page value led to no revenue, but as the page values the users led to revenue
ggplot(customers,aes(special_day,fill=revenue))+ geom_bar()
Fewer people visited the pages as it got closer to special days. The most revenue was seen on days not close to special days
ggplot(customers,aes(os,fill=revenue))+ geom_bar()
ggplot(customers,aes(browser,fill=revenue))+ geom_bar()
Since most users used browser 2 there were more people leading to revenue than other browsers
ggplot(customers,aes(region,fill=revenue))+ geom_bar()
Users from region 3 were the second most but had the better ratio of revenue and non revenue users compared to region 1
ggplot(customers,aes(traffic_type,fill=revenue))+ geom_bar()
Users using traffic types 1 to 4 had the highest revenue customers with traffic type 4 having the best ratio of revenue and non revenue
ggplot(customers,aes(page_values,fill=revenue))+ geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
few users who had a low count of pages values had led to revenue to the site, with most people leading to no revenue
ggplot(customers,aes(admin_duration,info_duration))+ geom_point(alpha=0.5)+ geom_quantile(size=0.9 ,alpha = 1,quantiles=0.50)
## Smoothing formula not specified. Using: y ~ x
Users with less administrative duration tend to spend more time on informational pages. As administrative duration increases users tend to spend less time on informational pages.
ggplot(customers,aes(admin_duration,info_duration,color=revenue))+ geom_point(alpha=0.5)
users who led to a revenue had lower durations in administrative pages
ggplot(customers,aes(admin_duration,product_related_duration))+ geom_point(alpha=0.5)+ geom_quantile(size=0.9 ,alpha = 1,quantiles=0.50)
## Smoothing formula not specified. Using: y ~ x
As administrative duration increases users tend to spend more time on product pages.
ggplot(customers,aes(admin_duration,product_related_duration,color=revenue))+ geom_point(alpha=0.5)+ geom_quantile(size=0.9 ,alpha = 1,quantiles=0.50)
## Smoothing formula not specified. Using: y ~ x
## Smoothing formula not specified. Using: y ~ x
Users leading to a revenue had a good blend of both tiome on administrative and product
ggplot(customers,aes(admin_duration,bounce_rates))+ geom_point(alpha=0.5)+ geom_quantile(size=0.9 ,alpha = 1,quantiles=0.50)
## Smoothing formula not specified. Using: y ~ x
Most users had really low bounce rates and durations in administrative pages
ggplot(customers,aes(admin_duration,bounce_rates,color=revenue))+ geom_point(alpha=0.5)+ geom_quantile(size=0.9 ,alpha = 1,quantiles=0.50)
## Smoothing formula not specified. Using: y ~ x
## Smoothing formula not specified. Using: y ~ x
users who led to revenue had an increase in bounce rates as their duration in administrative pages increased while users leading to no revenue had a decrease in bounce rates as their duration in administrative pages increased
ggplot(customers,aes(admin_duration,exit_rates))+ geom_point(alpha=0.5)+ geom_quantile(size=0.9 ,alpha = 1,quantiles=0.50)
## Smoothing formula not specified. Using: y ~ x
Exit rates decreased as user duration in administrative pages increased
ggplot(customers,aes(admin_duration,exit_rates,color=revenue))+ geom_point(alpha=0.5)+ geom_quantile(size=0.9 ,alpha = 1,quantiles=0.50)
## Smoothing formula not specified. Using: y ~ x
## Smoothing formula not specified. Using: y ~ x
Exit rates for user leading to no revenue decreased as their duration in administrative pages increased while the exit rates of thos leading to revenue remained fairly constant
ggplot(customers,aes(admin_duration,page_values))+ geom_point(alpha=0.5)+ geom_quantile(size=0.9 ,alpha = 1,quantiles=0.50)
## Smoothing formula not specified. Using: y ~ x
Users had fairly low page values and durations in administrative pages
ggplot(customers,aes(admin_duration,page_values,color=revenue))+ geom_point(alpha=0.5)+ geom_quantile(size=0.9 ,alpha = 1,quantiles=0.50)
## Smoothing formula not specified. Using: y ~ x
## Smoothing formula not specified. Using: y ~ x
Page values of users leading to revenue tended to decrease as their duration in administrative pages increased whilr those of users not reading to revenue remained really low
ggplot(customers,aes(exit_rates,bounce_rates))+ geom_point(alpha=0.5)+ geom_quantile(size=0.9 ,alpha = 1,quantiles=0.50)
## Smoothing formula not specified. Using: y ~ x
The general trend is as bounce rates increase so do exit rates
ggplot(customers,aes(exit_rates,bounce_rates,color=revenue))+ geom_point(alpha=0.5)+ geom_quantile(size=0.9 ,alpha = 1,quantiles=0.50)
## Smoothing formula not specified. Using: y ~ x
## Smoothing formula not specified. Using: y ~ x
While both customers leading to and not to a revenue have bounce rates increase with exit rates, those leading to a revenue had a slower rate of increment.
ggplot(customers,aes(product_related_duration,bounce_rates))+ geom_point(alpha=0.5)+ geom_quantile(size=0.9 ,alpha = 1,quantiles=0.50)
## Smoothing formula not specified. Using: y ~ x
Bounce rates tend to increase with an increase in product related duration.
ggplot(customers,aes(product_related_duration,bounce_rates,color=revenue))+ geom_point(alpha=0.5)+ geom_quantile(size=0.9 ,alpha = 1,quantiles=0.50)
## Smoothing formula not specified. Using: y ~ x
## Smoothing formula not specified. Using: y ~ x
Bounce rates tend to increase with an increase in product related duration, for both users leading to revenue and not.
ggplot(customers,aes(product_related_duration,exit_rates))+ geom_point(alpha=0.5)+ geom_quantile(size=0.9 ,alpha = 1,quantiles=0.50)
## Smoothing formula not specified. Using: y ~ x
exit rates tend to decrease with an increase in product related duration.
ggplot(customers,aes(product_related_duration,exit_rates,color=revenue))+ geom_point(alpha=0.5)+ geom_quantile(size=0.9 ,alpha = 1,quantiles=0.50)
## Smoothing formula not specified. Using: y ~ x
## Smoothing formula not specified. Using: y ~ x
Exit rates tend to remain fairly constant for users leading to revenue while the others tend to have exit rates dropping with increase in product related duration
label_encode <- function (column,dataframe){
data <- dataframe[[column]]
new_data <- as.integer(factor(data))
# print(new_data)
return (new_data)
}
users <- customers
users$Month <- label_encode('Month',users)
users$VisitorType <- label_encode('VisitorType',users)
users$Weekend <- label_encode('Weekend',users)
users$Revenue <- label_encode('Revenue',users)
get_lower_tri<-function(cor_mat){
cor_mat[upper.tri(cor_mat)] <- NA
return(cor_mat)
}
cor_mat <- round(cor(users),4)
lower <- get_lower_tri(cor_mat)
melted <- melt(lower, na.rm = TRUE)
## Warning in melt(lower, na.rm = TRUE): The melt generic in data.table has
## been passed a matrix and will attempt to redirect to the relevant reshape2
## method; please note that reshape2 is deprecated, and this redirection is now
## deprecated as well. To continue using melt methods from reshape2 while both
## libraries are attached, e.g. melt.list, you can prepend the namespace like
## reshape2::melt(lower). In the next version, this warning will become an error.
head(melted)
ggplot(melted, aes(Var1, Var2))+ geom_tile(aes(fill = value), colour = "white") +
scale_fill_gradient2(low = "#222222", high = "#1abc9c", mid = "#222222")+
theme(axis.text.x = element_text(angle = 45, hjust = 1))
library('ClusterR')
## Loading required package: gtools
library('caret')
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
get_train_validation <- function (dataset){
# Geting the row numbers for train sample (80% of the dataset)
train <- sample(c(1:nrow(dataset)),size = ceiling(0.80*nrow(dataset)),replace = FALSE)
# training set == part of the dataset in the train sample
train_set <- dataset[train,]
# Validation set == part of the dataset not in the train sample
Validation_set <- dataset[-train,]
# fix for R not accepting multiple argument returns
sets <- list("Train" = train_set, "Validation" = Validation_set)
return (sets)
}
normalize <- function(x){
return ((x-min(x)) / (max(x)-min(x)))
}
users.norm <- users
for (name in colnames(users)){
users.norm[[name]] <- normalize(users[[name]])
}
head(users.norm)
I decided to use pca to reduce the number of my columns
userd <- prcomp(users.norm[c(-18)], scale. = T)
screeplot(userd, type = "l", npcs = 15, main = "Screeplot of the first 10 PCs")
abline(h = 1, col="red", lty=5)
legend("topright", legend=c("Eigenvalue = 1"),
col=c("red"), lty=5, cex=0.6)
cumpro <- cumsum(userd$sdev^2 / sum(userd$sdev^2))
plot(cumpro[0:15], xlab = "PC #", ylab = "Amount of explained variance", main = "Cumulative variance plot")
abline(v = 6, col="blue", lty=5)
abline(h = 0.88759, col="blue", lty=5)
legend("topleft", legend=c("Cut-off @ PC6"),
col=c("blue"), lty=5, cex=0.6)
Pca reduction can effectively reduce dimensionality from 17 to 6
p_ca <- function (dataset){
sets <- get_train_validation(dataset)
pca.train <-sets$Train[,-c(18)]
pca.test <-sets$Validation[,-c(18)]
prin_comp <- prcomp(pca.train, scale. = T)
train.data <- data.frame(Revenue = sets$Train[[18]],prin_comp$x)
train.data <- train.data[,1:7]
test.data <- predict(prin_comp, newdata = pca.test)
test.data <- as.data.frame(test.data)
test.data$PC7 <-sets$Validation[[18]]
#test.data <- bind_cols(test.data, sets$Validation[[18]])
test.data <- test.data[,c(1:7)]
sets <- list("Train" = train.data, "Test" = test.data)
return (sets)
}
k_means <- function (train_set, Validation_set,n_start,max_iter,algorthm){
# getting the sets
x <- ncol(train_set)
x1 <- x-1
train_x <- train_set[,2:x]
train_y <- train_set[[1]]
validation_x <- Validation_set[,1:x1]
validation_y <- Validation_set[[x]]
model <- kmeans(train_x,2,nstart = n_start,iter.max = max_iter, algorithm=algorthm)
pred <- predict_KMeans(validation_x, model$centers, threads = 1)
if (pred[pred==2] > pred[pred==1]){
pred[pred==2] <- 0
}else{
pred[pred==1] <- 0
pred[pred==2] <- 1
}
t <- table(validation_y, pred)
accur <- cm_2x2_accuracy(t)
hyper <- list('nstart :', n_start,'iter.max :', max_iter,'algorithm :',algorthm)
resul <- list("model" = model,'table'= t, "Accuracy" = accur, 'Hyper_parameters' = hyper)
return (resul)
}
cm_2x2_accuracy <- function (t){
tp <- t[[1]]
fp <- t[[2]]
fn <- t[[3]]
tn <- t[[4]]
acuracy <-(tp+tn )/ (tp+fp+fn+tn)
return (acuracy)
}
wssplot <- function(data, nc=15, seed=123){
wss <- (nrow(data)-1)*sum(apply(data,2,var))
for (i in 2:nc){
set.seed(seed)
wss[i] <- sum(kmeans(data, centers=i)$withinss)}
plot(1:nc, wss, type="b", xlab="Number of groups",
ylab="Sum of squares within a group")}
wssplot(users[c(-18)], nc = 10)
The clusters are fluctuating and mostly at 2 and 6 clusters. But the gradient shift started at 2 hence those will be my clusters.
Given our target is a result which has either a true or false. The goal is to let the model come up with 2 clusters
options(warn=-1)
accuracy <- 0
algorithm <- c("Hartigan-Wong", "Lloyd", "Forgy", "MacQueen")
start.n<- c(2 : 17)
iter.max <- c(10,15,20,25,30,35,40,45,50,55,60,65,70)
for (algo in algorithm){
for (start in start.n){
for (iter in iter.max){
# Geting different train and validation sets on each run
sets <-suppressWarnings(p_ca(users.norm))
results <- suppressWarnings(k_means(sets$Train,sets$Test,n_start = start,max_iter = iter,algorthm = algo ))
if (results$Accuracy > accuracy){
best_table <- results$table
accuracy <- results$Accuracy
best_model <- results$model
best_hypers <- results$Hyper_parameters
}
}
}
}
print(list('Best Accuracy :',accuracy*100,'With Hyperparameters at:',best_hypers))
## [[1]]
## [1] "Best Accuracy :"
##
## [[2]]
## [1] 80.3608
##
## [[3]]
## [1] "With Hyperparameters at:"
##
## [[4]]
## [[4]][[1]]
## [1] "nstart :"
##
## [[4]][[2]]
## [1] 4
##
## [[4]][[3]]
## [1] "iter.max :"
##
## [[4]][[4]]
## [1] 10
##
## [[4]][[5]]
## [1] "algorithm :"
##
## [[4]][[6]]
## [1] "Hartigan-Wong"
confusionMatrix(best_table)
## Confusion Matrix and Statistics
##
## pred
## validation_y 0 1
## 0 1858 235
## 1 244 102
##
## Accuracy : 0.8036
## 95% CI : (0.7873, 0.8192)
## No Information Rate : 0.8618
## P-Value [Acc > NIR] : 1.0000
##
## Kappa : 0.1845
##
## Mcnemar's Test P-Value : 0.7147
##
## Sensitivity : 0.8839
## Specificity : 0.3027
## Pos Pred Value : 0.8877
## Neg Pred Value : 0.2948
## Prevalence : 0.8618
## Detection Rate : 0.7618
## Detection Prevalence : 0.8581
## Balanced Accuracy : 0.5933
##
## 'Positive' Class : 0
##
The k means performed with 79.87 % accuracy.
The model did not perform as well as i had hoped it had alot more false negatives than true negatives
users.scald <- users
d <- dist(users.scald[,-c(18)])
h_clust <- hclust(d, method = "ward.D2")
plot(h_clust)
Selecting the number of clusters to use is a choice made by the user after creating the tree There are 2 main clusters in the dendogram, with the left having a more doinant number of clusters. for 3 clusters : the left main branch and the two branches of the right main branch would form the clusters for 4 clusters : the left main branch and the three branches of the right main branch would form the clusters
But for comparison purpose, i will use the same clusters as in the k means, also for ease of evaluating the performance of the trees
groups <- cutree(h_clust,k=2)
h_users <- users.scald
h_users$Revenue <- groups
table(groups)
## groups
## 1 2
## 10256 1943
Given there is no method for ‘predict’ applied to an object of class “hclust” meaning the only way to check the performance of the heirachical tree would be to train a supervised model and use that to predict your outputs. However That adds another layer of uncertainty. Say for instance the tree was perfect but the model achieves a 89% accuracy, or the tree had a not so perfect accuracy but the model achieves 98+ % accuracy in the end the result will be weighed down with more than one layer of uncertainty
library('class')
accuracy <- function(x){sum(diag(x)/(sum(rowSums(x)))) * 100}
k_nn <- function (train_,test_,K=21){
# getting the sets
x <- ncol(train_)
x1 <- x-1
train_x <- train_[,2:x]
train_y <- train_[[1]]
validation_x <- test_[,1:x1]
validation_y <- test_[,x]
val_len <- length(validation_y[validation_y==1])
pred <- knn(train = train_x, test = validation_x,
cl = train_y, k = K)
t <- table(validation_y,pred)
accur <- cm_2x2_accuracy(t)
if ((length(pred[pred==1]))<(val_len/2)){
accur <- 0
}# Eliminating a couple of imbalanced models
hyper <- list('k :', K)
resul <- list('table'= t,"Accuracy" = accur, 'Hyper_parameters' = hyper)
return (resul)
}
distances <- c('euclidean', 'maximum','manhattan','canberra','binary','minkowski')
methods <- c('ward.D2','single', 'median', 'average', 'centroid', 'ward.D', 'mcquitty')
accuracy <- 0
for (dist in distances){
print(dist)
for (meth in methods){
for (i in c(5,10,15,20,25,30)){
sets <- p_ca(users.norm)
train <- sets$Train
valid <- sets$Test
d <- dist(train[,-c(1)],method=dist)
h_clust <- hclust(d, method = meth)
groups <- cutree(h_clust,k=2)
if (length(groups[groups==2]) > length(groups[groups==1])){
groups[groups==2] <- 0
}else{
groups[groups==1] <- 0
groups[groups==2] <- 1
}
h_users <- train
h_users$Revenue <- groups
results <- k_nn(h_users,valid,i)
if (results$Accuracy > accuracy){
best_table <- results$table
accuracy <- results$Accuracy
best_hypers <- results$Hyper_parameters
best_tree <- h_clust
best_method <- meth
best_distance <- dist
}
}
}
}
## [1] "euclidean"
## [1] "maximum"
## [1] "manhattan"
## [1] "canberra"
## [1] "binary"
## [1] "minkowski"
print(list('Best Accuracy :',accuracy*100,'With Hyperparameters at:',best_hypers))
## [[1]]
## [1] "Best Accuracy :"
##
## [[2]]
## [1] 81.42681
##
## [[3]]
## [1] "With Hyperparameters at:"
##
## [[4]]
## [[4]][[1]]
## [1] "k :"
##
## [[4]][[2]]
## [1] 30
confusionMatrix(best_table)
## Confusion Matrix and Statistics
##
## pred
## validation_y 0 1
## 0 1919 130
## 1 323 67
##
## Accuracy : 0.8143
## 95% CI : (0.7983, 0.8295)
## No Information Rate : 0.9192
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1355
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.8559
## Specificity : 0.3401
## Pos Pred Value : 0.9366
## Neg Pred Value : 0.1718
## Prevalence : 0.9192
## Detection Rate : 0.7868
## Detection Prevalence : 0.8401
## Balanced Accuracy : 0.5980
##
## 'Positive' Class : 0
##
The heirachical model performed really well with 81.34 % accuracy.
the model has performed really well classifying the users
It was the best of the two models running here
Clusters in K means dont overlap while in heirachical the clusters are nested
Given these are three models I gauged to see their performances and tuned them to get the best models in each category, using other classification models and compare them to see which one performs best.
While the heirachical clustering was more accurate however since i kept geting an imbalanced validation set it will not be good enough to evaluate if the model could predict
Yes the data was appropriate for the question asked
Yes the question was right for the data provided